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Abstract 

A model-based approach to rater reliability for essays read by 
multiple readers is presented. Variation of rater severity (between- 
rater variation) and rater inconsistency (within-rater ' variation) is 
considered in presence of between-examinee variation. An additive 
variance component model is posited and the method of moments 
for its estimation described. The model involves no distributional 
assumptions. Minimum mean squared error estimators of exeiminees' 
true scores and readers' severities are derived. Model diagnostic 
procedures are an integral component of the approach. The methods 
are illustrated on data from standardized educational tests. 

Some key words: mean squared error, reliability, shrinkage estima- 
tors, variance components. 
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Introduction 

Reliability of the scoring process for examinees' responses graded by expert 
raters (readers) is an important concern in educational testing. Growing reliance 
on item-types with non-standard format, such as constructed response items 
and portfolios, requires development of methods for analysis of between-rater 
differences and for adjustment of scores. Traditionally, these two problems have 
been treated without their integration; one method was applied to estimate 
between-rater differences, and another method was used to adjust the scores 
based on these estimated differences. This paper presents a method in which 
the two problems are treated integrally. In peirticular, schemes for adjustment 
of examinee scores are proposed which Eire not intermediated by the estimates 
of characteristics of the individual readers. 

Motivated by the generalizability theory (Shavelson and Webb, 1991), we 
focus on a populaiion oi readers, rather thein the specific readers that happen 
to have been recruited, emd use variance parameters to summeirize differences 
among the readers. An importeint rationale for this is to make inferences from 
one set of readers, examinees, and test forms applicable to other settings which 
can be regarded as draws from the same population. The adjustment schemes 
are based on shrinkage estimators (Morris, 1983) of the true scores. They in- 
corporate information about the readers eind take account of the uncertainty 
about their characteristics. The main advantages of this approach are in model 
parsimony, ability to pool information across administrations of tests, and ap- 
plicability to any noninformative assignment design (of readers to assays). 

The approach presented here is similar to that of Braun (1988), and extends 
it in some aspects, in particular, for multiveiriate scores and for estimation of true 
scores. For brevity, the term 'true score' of a response is used for the expected 
score of the response over the readers in the population from which they have 
been drawn. Braun (1988) gives a comprehensive review of the literature on 
scoring reliability. Linacre (1988) and Lunz, Wright, and Linacre (1990) consider 
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reader severity as an additional facet of the Rasch model for polytomously scored 
items, and estimate severity of each reader. Rater reliability as a source of 
measurement error has been extensively studied in medical applications; see, 
Landis and Koch (1980), Tanner and Young (1985) and Uebersax (1993). 

The approach taken here is applicable to both continuous and ordinal cate- 
gorical scales, and it enables direct estimation of both reader characteristics and 
examinees' true scores. Approaches based on the classiceil analysis of variance 
(AN OVA) or the ordinary regression are oriented toweird estimation of the effects 
associated with the engaged ('realized') readers and examinees, as opposed to 
the hypothesized populations from which they are drawn. Thus, no inference 
can be made about a future administration of the same or a similar test form 
to examinees drawn from the same population and using readers from the same 
population. Also, ANOVA estimation of the effects associated with the readers 
and responses is very inefficient when there Eire a Icirge number of units (readers 
and/or examinees) because a considerable amount of information provided by 
the other units is not used. The approach implemented in Linacre (1988) is an 
adaptation of ANOVA for binary data, and it shares the problems of its classical 
counterpart. Ueberseix (1993) gives a comprehensive review of other approaches 
based on models akin to item response theory. 

In a typicEil situation an item is administered to / examinees of veirying 
ability, and each response, constructed or performed by an examinee, is scored 
K times, at most once by each of J readers. An item is an instruction to write 
an essay, solve a problem, perform a task, or the like. An examinee's response 
may be documented on paper, computer, videotape, or the like, or observed 
during the performance. The scoring scale (the range of possible scores) and 
rubric (the correspondence of the ability, skill, or knowledge to the scale scores) 
are important components of the item definition. The readers are experts in 
the subject area and have received extensive training and instruction about the 
rating process. Rating of recorded responses is organized into sessions; each 
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session consists of one rating of each response. 

Usually, the mean rating given by the A' readers who were assigned a given 
response is adopted as its score. In a simplistic approach the sample intercorre- 
lations among the K readings (sessions) are used as a measure of agreement of 
the readers. In this paper, an approach to reader reliability based on a variance 
component model is presented. To motivate the development and to highlight 
the deficiencies of some of the established approaches, consider the following 
two extreme assignment schemes: 

• Each reader is assigned all the essays in a session (J = K) 

• Each of the IK readings is done by a different reader ( J = 7 A'). 

A pair of readers is said to be consistent if the differences in the scores they 
give to the responses that both of them have rated are constant. Formally, 
declare two readers who have rated not more than one response in common as 
consistent, also. A set of readers is said to be consistent if every pair in the set 
is consistent. 

For a set of consistent readers the sample correlations for the pairs of sessions 
(i.e., readers) in the first assignment scheme are all equal to unity, but in the 
second scheme these correlations may be much smaller. This is disconcerting; 
the sample correlation depends on the assignment design, even though it is 
supposed to be a characteristic of the rating process. 

Two distinct ways in which readers nay differ can be readily recognized. 
The readers may vary in their severity, some tend to give higher scores while 
others tend to give lower scores. Further, readers may disagree on the relative 
merits of the responses; reader A may rate response x higher than response y, 
disagreeing with reader B who rates response y higher than response x. Such a 
disagreement is referred to as inconsistency, or reader- by-examinee interaction. 
Unlike severity, which, in principle, can be corrected by adjusting (calibrating) 
the scores, there is no way of adjusting for inconsistency. 
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It will be shown that the standard approach to score adjustment based on 
estimation of reader severity is deficient, because the optimeJ adjustment de- 
pends not only on the estimated severity but also on the amount of information 
about severity; that is, when severity is poorly estimated it should be given 
small weight in the adjustment. A direct method for estimation of true scores, 
which does not rely on estimates of severity, will be presented. 

A conceptually useful way of studying the problem of reader reliability is to 
consider the / x J matrix Y = of scores given to response i by reader j. 
Usually, most entries of this matrix are not observed (e.g., when each response is 
rated twice, there are J — 2 missing observations in each row of Y). Consistency 
corresponds to constant differences between any two columns of Y. In that case, 
the ordering of the scores is the same in each column. Departure of the scores 
from this pattern corresponds to inconsistency. 

Each response need not be rated the same number of times. Let «;< be the 
set of sessions in which response i was rated, and A'j be the number of these 
sessions (1 < A'j < K). For instance, when each response is rated once in each 
of two sessions, Ki —2 and «;< = {1,2} for all responses i. 

The readers may be assigned unequal numbers of responses both within and 
across the K sessions. It will be assume<? that the process by which responses 
are assigned to readers is non-informative in each session, as are the sets {«;<}, 
so that the process can be r^jgarded as randomized, subject to the constraint 
that no essay be read twice by the same reader. Often there are no systematic 
differences among the sessions, and so they can be regarded as interchangeable. 
An example to the contrary is likely to Mise when, say, the readers are instructed 
be cween two sessions to be more lenient. 

The paper is organized as follows. The next section describes a variance 
component model for readers' scores. The between-exeiminee, within-reader, 
and between-reader variances are identified as descriptors of the rating process. 
In the following section the moment method for estimation of these variances is 
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described. Then, the rater reliability model is expanded to account for varying 
behaviour of the readers across sessions, for consistent differences among the 
readers and for multivariate scores. The following two sections present equations 
for calibration of the readers and for near-optimal estimation of the examinees' 
true scores using shrinkage estimators. Diagnostic procedures are described in 
the next section. A section contains several examples in which performance of 
the proposed adjustment schemes is evaluated. A simulation study is used for 
generating standard errors and for assessing the impact of uncertainty about 
the variances on score adjustment. Tne paper is concluded with a summary. 

Variaince component model 

For the realized scores yij consider the additive model 

Vi.},!. - tti + l^j.k + e.-jifc . (1) 

where j,fc is the index for the reader who graded the response o' eveniinee 
i — 1,.. .,/ in session ik G k.; «. is the true score of examinee i, is the 
severity of reader j — I,..., J, and eij is the residual term interpretable as a 
reader-by-examinee interaction. 

Different (disjoint), overlapping, or identical pools of readers may be used in 
the sessions. Let Jt be the number of readers used in session k,njk the number 
of responses graded by reader j in session fc, and nj the total number of responses 
graded by reader j, that is, rij — Yl^=i ''^^e total number of readers used 
(the size of the reader pool) is denoted by J. Note that I3*=i Jk > J > maxk Jk- 
Further, let h be the number of responses rated in session k, and N the total 
number of ratings, so that N - Ylk=i ^k - SLi •^«- ^ session in which every 
response is rated, h — I, is called a complete session. Rating of an essay usually 
consists of a small number (i^ = 1 or 2) of complete sessions. 

It is assumed that {a,}, {/3,}, and {£,;■} in (1) are mutually independent 
random samples with respective means fx, 0, and 0, and variances tr^, tr^, and 
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(Tg. These variances represent variation of examinees' true scores, variation in 
reader severity, and reader inconsistency, respectively. Various departures from 
the model assumptions can be considered as a component of (t^. 

Given the model in (1), the scores given to the same response by two different 
readers have the correlation 

2 

n = cor(j,,,,.,y,,J = ^3+^2 + ^2- 

while a pair of hypotheticeil scores given to the same response from two inde- 
pendent ratings by the same reader have the correlation 



'a ^ "b 

The correlation of the true score with the mean of K scores given for a response 
is 



(2) 



where = (T^/al and Tg = al/al aie the relaiive variances of severity and in- 
consistency, respectively. The c9rrelation of a pair of mean scores in independent 
replication of the rating process, equal to r^, is often quoted as a measure of the 
quality of the rating process. For two complete sessions the sample correlation, 



r = 



\/Ei(y.-,j.> - yiyHiivij.! - y^Y 

(j/t is the sample mean of the Ik scores given in session k) is considered as an 
estimator of ri = 1/(1 + + re). Hence could be estimated by a suitable 
transformation as 

Figure 1 summarizes the relationship of these two correlations. 



12 



Note that the estimator r does not involve independent scores because sets of 
examinees share the same readers. Also, when the designation of the assignment 
of ratings to sessions is arbitrary, the correlation depends on the selected assign- 
ment. The correlation r is affected by the assignment design and compounds 
variation in severity and inconsistency. Identification of these components is es- 
sential for improved calibration of readers, estimation of examinees' true scores, 
and for informed choice of the assignment design. 

Estimation 

The variance components a^, cr\, and cri are estimated by matching certain 
sums of squares with their expectations, and these estimates are substituted 
for the true values in the appropriate expression for a correlation or another 
quantity. 

Define the following statistics: 

1. the u)j/Atn-ea;ammee sum of squares, 



1 = 1 fc6K. 

where y,-,. = I^ieti y*j'kl^i the mean score for response i\ 
2. the wiihin-reader sum of squares for session k, 

SR,k = ~ 
<6(fc) 

where the summation is over all responses rated in session k, and 2y,j; is 
the mean score given by reader j in session k, 



3. the ioial sum of squares. 
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ST,k = '^{ViJ.k -Vk? 
i 

where yit is the mean score in session k, 

i 

The expectations of these statistics, assuming the model in (1), are 



E{Se) = {N-I){al + al) 
E(5fi,fc) = {Ik-Jk)Wl + <^l) 



E(5T.k) = {Ik--^){'rl + al)+\^h-?^yi (3) 

These expectations are linear functions of the variance components. This is 
particularly advantageous for the moment matching method described below. 
To avoid trivial cases, assume that K > 2, J > 2, eind N > I, that is, at least 
one response is rated more than once. When Jit = 1 the expectations E(5/j,ib) 
and E{ST,k) coincide. There may be a session with a single reader, Jk — 1, but 
the total number of readers, J, has Uj be greater than one. 

Matching the statistics of Sb, Sr = Ylk ^ii,k> and St — I^it ^T,k with their 
expectations leads to a system of three linear equations which has the solution 



-2 _ ST-SR{N-K)/{N-j:th) 

These variance estimates can, in principle, be negative. Such values are ad- 
missible when the variance components are interpreted as certain (conditional) 
covariances. In practice, it is often meaningful to replace them by zeros. 
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Extensions 



The readers may conduct themselves differently in each session. For example, 
the model in (1) can be expanded to allow for session-specific severity of each 
reader. Consider the average severity /?,-, and a deviation fj^k of the reader's 
severity in session k from the average severity (reader-by-session interaction), 
so that each score conforms to the model 

y.- J.* = a.- + + 7i,»,it + , (5) 

where {Tj,*} is a random sample from a distribution with mean 0 and variance 
o-j, independent from the other random variables (q, /?, and e). The variance 
o"! represents within-reader between-session variation. As a further extension, 
variances <7j j. specific to sessions can be considered. Also, it may be meaningful 
to consider session specific means Hk = E(i/ij,^ | fc), and/or inconsistency vari- 
ances (T^ fe varying from session to session, so as to accommodate, for instance, 
higher inconsistency in the first session. 

For illustration, assume the variance o-^ to be common to all sessions. In 
addition to the statistics Sb, Sn^k, and ST,k define the between-session sum of 
squares as 

it j 

where Zj is the mean of all the scores given by reader j; 

= E y'-i = i:E"i*^>.*- 

The expectation of Sb, assuming the model in (5), is 



(6) 
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The expectations of the other sums of squares, Se, Sn^k, and ST,k, are obtained 
from (3) by replacing with + ag (the equation for Sn^k is unchanged). 
The estimates of the variances a^, (rf, (7^, and (t^ are obtained by solving the 
system of four linear equations that match the statistics Se, Sr, Sb, and St 
with their (theoretical) expectations. The system of equations can be solved 
using (4), with af replaced by <Tj + (7^; the estimate of the sum of the variances 
(7j + (7j is then decomposed using (6). 

Severity of the readers may depend on extraneous factors, such as the time 
of the day. Also, it may be desirable to relate severity to readers' characteristics 
and attributes, such as gender and experience. Such features can be accommo- 
dated by replacing the random terms /?j in (5) by a linear regression. If the 
factors of interest are categorical, the moment matching method of estimation 
can be supplemented by suitable contrasts of scores which facilitate identifica- 
tion of the effects. 

Equations for other extensions of the proposed model, including models that 
accomodate session-specific severity veiriance and reader inconsistency, are de- 
rived analogously. Such models are likely to be useful only when essays are read 
a relatively large number of times (e.g., K > 3 times) by identical or highly 
overlapping pools of readers. 

Multiple criteria 

Readers sometimes score responses to an item for several aspects, such as techni- 
cal skill, originality, and presentation. The models and the associated methods 
of estimation have straightforward extensions for multivariate scores; the equa- 
tions (3) and (6) remain valid, with the variances (t^, (t^, (t^, and (t^ replaced by 
variance matrices Sa, -Sj, Se, and Eg. These matrices are of interest for ex- 
ploring relationships among the component scores (between examinees, between 
readers, and within readers). There is no obvious extension for the correlations 
rj, r2 and ra to the multivariate case, other than defining the correlations of 
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the component scores. 



Adjustment for severity (calibration) 

Standard approaches to estimating examinees' true scores, exemplified by Braun 
(1988), focus on estimation of readers' severity coefficients which would then 
be used for adjustment of the examinees' scores. The method described here 
estimates true scores without intermediation of the estimated readers' severities. 
The advantage of this method can be readily recognized by considering readers 
assigned extremely small or large workloads. The severity of a reader with 
a small workload is estimated subject to substantial sampling variation, and 
therefore adjustment by this 'noisy' quantity is not advisable. On the other 
hand, adjustment for severity is effective when the severity is well determined. 
Thus, the amount of information about severity of the readers should play an 
important roie in efficient eidjustment. 

In the following two sections minimum mean squared error estimators of 
the readers' severities and exeiminees true scores are derived. Although in this 
approach severity estimates are not necessary for estimation of true scores, they 
are still useful for identifying unusual readers. 

Estimating severity 

When the sev°-r' ;y of reader j is realized on several ratings, the realization of 0j 
is estimable. The conditional expectation of , given the model parameters and 
the data, is an obvious estimator of I3j . Evaluation of this expectation involves 
inversion of the variance matrix for all the ratings, a formidable task without 
taking advantage of the pattern of the matrix. Complex algebra is involved even 
if the pattern is appropriately exploited. 

An alternative method can be motivated by shrinkage estimation. Consider 
the following two estimators of the realization of /?j : the trivial estimator identi- 
cally equal to zero, and the difference of the mean of the ratings given by reader 
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j and the mean of all the ratings, Zj — y. When <rj = 0, zero is the optimal 
estimator of the realization of /?,• , because the readers do not differ in severity. 
When ffj is large, or the workloads ri; are large, Zj — y is a good estimator of 
0j. The linear combination of these estimators, 

Pj,,- Sj{zj -y), 

is adopted, with the reader-specific coefficient Sj which minimizes the the mean 
squared error (MSE), E(/9,-,, -Pjf. 

Denote by \j, k] and \j] the respective sets of responses rated by reader j m 
session k and in all the sessions; \j] = UtLji The model equation (1) implies 

1 1 ^ 

■' .-em ' k=i i^]j,k] 

and elementary algebra yields 

(1 - + - »,)^ + i „j, j + .j^ (± _ ^) 

= Cj,o-2Cj,iSj+Cj,2s] , (7) 

where 



Cj,2 = 
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Note that these equations contain totals over all exciminees 
the examinees rated by reader j (I3ie[j])- former are common to all readers, 
but the latter, together with the workloads nj, may vary among the readers, 
resulting in different optimal coefficients sj . 

The MSE in (7) has a unique minimum at Sj = Cj,i/Cj,2, and the attained 
minimum is Cjfi-Cj i/Cj,2- The coefficient Sj can be interpreted as the optimal 
shrinkage of the deviation zj - y towards zero. When the readers' workloads do 
not vary a great deal, max,- nj < n^N, and hence 0 < Cj,i < Cj,2 . Then 
0 < < 1. Values of Sj close to zero and unity are attained only in unusual 
scenarios; for instance = 0 only if (rf =0. 

In practice, the variances are not known and their estimates are used instead. 
This is problematic in smdl samples, that is, when the number of readers and/or 
examinees is smcJl. The simulations discussed below provide some insight into 
sample size issues. When the variances are known, severity of each reader is 
estimable even if each response is rated only once. Note how calibration de- 
pends on the reader's load nj; in general, higher load nj is associated with less 
shrinkage toward zero. 

Estimating true scores 

In practice, estimation of a reader's severity is of secondary concern to esti- 
mation of the examinees' true scores a,-, although estimation of the coefficients 
fij can facilitate this. Simplistic schemes for adjustment of the 'raw' scores 
j/j = /<ri ^jijg^ yi,j,k I based on various linear combinations of the mean 
score given by reader j, Zj, and the mean score for all the sessions, y. Common 
examples of such adjustment schemes are 

' fcet. 
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' ifce<c. 



where is an estimator of the realization of . 



In the literature on rater reliability, the discussion about scoring is often 
limited to the dilemma of whether to adjust (use a^"^ or d^'^), or not (use y,-,.). 
A continuum of shrinkage estimators can be defined to fill in the void between 
these two extremes. For instance, the estimator 



with a constant u, corresponds to no adjustment when u = 0, and to (full) 
adjustment by the mean of the K readers' means who rated the response i 
when « = 1. It is shown below that intermediate veilues of u yield more efficient 
estimators. Moreover, different coefficients u can be used for the examinees; 
more shrinkage is appropriate for responses rated by readers who had small 
workloads because the ratings couteiin less information about the severity of 
their readers. 

Another class of estimators of Qi is given by the equation 



where t is a constant. It does not have as appealing a motivation as (8); the 
estimator adjusts the scores by shifting them closer to the reader's means. Nev- 
ertheless, in several examples analyzed below dj,, performs better than di,ti. 
The coefficients u in (8) and t in (9) can be set so as to minimize the expected 
squared error. Before determining these coefficients, consider a more general 
scheme based on the class of estimators formed as the \ineax combinations of 
the statistics j/i,., 2;, and y: 






(9) 
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cti = viiVi,. - ^ X) ^j'" + ^^'y' 

where vm, h = 1, 2, 3, are the examinee-specific coefficients, such that 

VU-V2i + V3i = l. (11) 

This constraint is necessary to ensure unbiasedness, that is, E(di — o,) = 0. 
The shrinkage coefficients Vhi are chosen so as to minimize the MSE for the true 
score o,-. Although the algebra involved appears tedious it is elementary. 

Let r,i'_fc be equal to 1 if responses i and i' were graded by the same reader 
in session k, and equal to 0 otherwise. Further, let n+ = ^Zfceici ^j,k/^i> "i" = 
EK«,"7.;M-<,"™ = E;''?/W,an'i 

Note that 1 < nt </ and 1 > nr > 1/7. The MSE of d,- is 

E(d.--a.-)2 = 

{(1 - vu? + 2(1 - vu) {v2inr - V3i^) + v^f- - v^i. ...-^ + v^^] al 

+ {'^liT?: + 2(i^ii - i^2i)i^3i;^ - (2i;ii - V2i)v2i^^ + 4-^} (12) 

This is a quadratic function in the coefficients Vhi- Assuming non- negative 
variances cr\, crl, and a\ the coefficients with the quadratic terms v\i are all 
positive. Therefore, (12) has either a unique minimum or a continuum of minima 
located on a straight line. Differentiation with respect to the coefficients Vhi 
yields the linear system of three equations 

vuAn - V2iAi2 + KiV3iAi3/N = /<,-<r^ 

VuAl2-V2iA22 + KiV3iAl3/N = KiU^ffl 



VuAl3-V2iAl3+V3iA33 = Kiffl , (13) 
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where 



An = Kicrl + <tI + (tI 

An = KiUi al + al + Tii a'^ 

Ai3 = A>a + nftr^ + <rl 

A33 = {.Ncrl + KivS^'^CTl + KiCTDlN. 



The constraint in (11) can be enforced either by substitution, or by application 
of the Lagrange multipliers. When the veiriance of y is negligible in comparison 
with that of Zj and j/i, e.g., when there are few sessions and many readers, 
application of the constraint corresponds to adjustment of Vzi , with negligible 
adjustments of vu and V2i- Thus, it suffices to solve the first two equations in 
(13) and then set vzi = 1 — + V2%- When a shrinkage estimator pj^, is used 
instead of Zj in (10), these two adjustment schemes coincide because is a 
linear function of Zj euid y. 

Several problems with the solution of (13) are readily recognized: difficulty 
with interpretation of the optimal coefficients vu , lack of any insight into the 
dependence of the coefficients on the varieuice components, sampling variation of 
the coefficients, and the amount of reduction of the MSEs over simpler schemes. 
These issues eire discussed below using several examples. 

For the more restrictive schemes given by (8) and (9) the equations for the 
MSE are substantially simpler. For (8), 




18 




ERIC 



= Do,,- - 2Di,,«,- + P2.,«.- . 



(14) 



where 



Do,i = 




Di,i = (ft 



D2,i 




) 



When D2,i > 0 the MSE in (14) has a unique minimum at u* = Di,i/D2,i, and 
its minimum value is Do,i-Dl i/D2,i- When the readers' workloads do not vary 
substantially, D^.i > ^i.t > 0, and then 0 < «* < 1; u* can be interpreted as 
a shrinkage coefficient. No adjustment (u* = 0) is optimal when Uj-^. = / for 
all it (one reader per session rating all the responses). When each reader rates 
every response, £>2,t = = 0, and so the MSE in (14) is constant. In that 
case adjustment is superfluous. 

Note that the examinee variance al is not involved in Di,i. Since the coeffi- 
cient of <xl in £>2,t is positive, higher el (all else held equal) leads to smaller u* 
(less shrinkage). On the other hand, higher between- reader (severity) variation 
and higher inconsistency are associated with more shrinkage. 

In the administration of a typical large-scale testing program there are a large 
number of examinees and readers, and each response is rated a small number 
of times (once or twice). Each reader has a substantial workload , and so the 
scores contain abundant information about his/her severity. If readers' severities 
are variable, adjustment with high «< is likely to be better them no adjustment. 
Note that even when <rl - Q the optimal adjustment is for u< > 0. However, 
when the workloads Uj are much smaller than the number of examinees /, the 
coefficient of (Tj in Di,,- is much smaller than that in D2,i (the coefficients of 
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are equal). Therefore, adjustment is more important (u* is larger) the larger 
the severity variance <Tj . 

The estimators of the true scores a,- based on the coefficients u* break down 
in some extreme scenarios. For example, when o"^ = 0 the optimal estimator of 
each examinee's score a,- is the sample mean y. When the readers are perfectly 
consistent, (t^ = 0, and each pair of readers can be linked through responses 
(reader A is linked with reader B if there are readers C, D, E, . . ., Z, such that 
there is at least one response each rated by the pairs of reaaers A and C, C 
and D, D and E, . . ., and Z aud B), then the true scores a,- can be determined 
exactly. In these cases the estimator d,_u is very inefficient. 

The estimators given by (9) can be analyzed by the same approach. The 
mean squared error is 

- 2<.-^(l -".-) + {'^l (l - 2nr + + - nf)} 

= Eo,i-2Ei,iti + E2,ith (15) 

for implicitly defined coefficients Ek,i , and so its unique minimum is attained 
at 

^. ^ ^.^(1-".-) 

<Tl(Ki - 2A'.nr + lU) + <Tj{l - nf ) ' 

Clearly 0 < t* < 1. When the readers are very inconsistent, that is, <t^ is much 
larger than <Tq, <• is close to unity. When (t^ is much smaller than <Tq, is 
close to zero, and so the optimal adjustment is minute. This is in agreement 
with intuition. It is interesting, though, that the optimal coefficient does 
not depend on <Tj. This is of some importance because <Tj is usually the least 
preci&v-'y estimated variance component. 

Diagnostic procedures 

The readers may display any of a whole gamut of behaviours different from that 
assumed by the model in (1). In this section, informal procedures for detecting 

20 



24 



some types of departure from the model in (1) are discussed. 

Homogeneity of the inconsistency deviations £ij is an important assumption 
in (1). Define the sum of squares within reader j as the subtotal within 5fl,t 
corresponding to the reader j; 

Assuming normality of the scores t/,j, 5'iijt/(<T^ + <Tg) has the distribution 
with rijk — 1 degrees of freedom. For large rijk its distribution is approximately 
Xnjk-i' ^v^" when the scores are not normally distributed. The statistics SRjk 
can be pooled over sessions k, but not over readers because the corresponding 
statistics are not independent. Thus, checking for variance homogeneity involves 
comparison of the statistics Snjk or their combinations with the critical values 
of the appropriate distributions. Unlike in other uses of the x^ distribution 
both very large and very small values of the statistics are evidence against the 
model. Small values are a sign that the reader is giving almost the same grade 
for every response. 

A finer insight, relevant for large scale tests, is enabled by considering re- 
sponses rated by the same pair of readers. The variance of the differences 
s/a — yi2 for such a set of responses is 2<Tg. The corresponding (within-pair) 
sums of squares can be pooled, because they are independent, thus generating 
statistics which can be compared with the critical values of the corresponding x^ 
distributions. Readers who tend to disagree with their fellow-readers, or agree 
with them, more than would be expected can be identified from these statistics. 

These two sets of statistics imply a general diagnostic method based on 
defining suitable subtotals of the sums of squares Se, Sr, St, and others, if 
applicable, which have conditionally independent summands. If the number 
of terms in these totals is moderate to large, the totals are approximately x^ 
distributed. The sums of squares selected should reflect the principal concerns 
about model violations. 
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These diagnostic procedures rely heavily on the assumption of non-informa- 
tive allocation of responses to readers. For instance, it is difficult to distinguish 
between a reader who was assigned lower quality responses from a reader with 
high severity, in particular when each response is rated a small number of times. 

Reliability 

In the approach presented here the mean squared error appeeirs to be a more 
natural metric for assessment of the rating process than the traditioneilly applied 
correlation coefficients, such as To, and r^. Counterparts of these correlations 
can he defined for adjusted scores, for instance, by replacing the unadjusted 
score in (2) with the adjusted score. 

For the optimal shrinkage coefficients u* and t* these correlations are: 

ra.iu = cor(ai,di,„.) = / = I ^ " "* f ^ " ^) 1 

ra.it = cor(ai,di.,.) = , ° (17) 

y^o,f - E\ JE2,i 

where Eo,i, Ei,i, and £'2,1 are the absolute, lineM, and quadratic coefficients of <,• 
in the right-hand side of (15). Note that unlike To, the correlations ro,iu and ro,,j 
are not constant across the responses i, unless a beJanced assignment design is 
employed. Equation (17) implies that for every response i the correlation ra,it is 
greater than Ta, but ro,iu may not be. The counterparts of are the correlations 
,„ and . The corresponding equations for the general adjustment scheme 
based on (10) are not .tractable. 

Examples 

Advanced Placement Biology test 

The Advanced Placement Biology test contains a large number of multiple- 
choice items and four constructed response items (essays) that Bxe rated by 
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expert readers. The dataset, drawn from an experimental administration in 
1992, comprises scores on the multiple-choice items and the four essays for 297 
examinees. Each essay was rated by two readers; a different set of readers was 
used for each essay. Most readers rated 24 or 25 essays in each of the two 
sessions. There are a few exceptions when a reader in the first session was 
apparently replaced by a different reader in the second session. The allocation 
of readers to essays is almost balanced; most pairs of readers rated five responses 
in common. The rating scale is 0 - 10 (integer scores). 

The estimates of the variance components are given in the left-hand side of 
Table 1 . For all four items reader inconsistency dominates variation in severity. 
In fact, for essay C the estimated severity variance is essentially zero. That 
limits the scope of adjustment. The reduction of the MSE due to adjustment is 
modest, though perceptible, for all examinees and all essays. 

Note that all conclusions related to MSEs made in this section are contingent 
on the assumption that (variance of the true scores), (severity variance), 
and (inconsistency variance) are known and equal to their estimates. This 
assumption is subjected to scrutiny in the next section. 

The estimated mean squared errors are summarized in Table 1. For the 
unadjusted scores (column 'NAdj') the MSEs are constant across the exami- 
nees. For the adjustment schemes based on (8) ('uAdj'), (9) ('tAdj'), and (10) 
('AAdj') the ranges of the MSEs are given for each item. For instance, using 
the adjustment scheme 'AAdj' the estimated mean squared errors for item C 
are in the range 0.361 - 0.367. 

For essay A the MSE for the raw mean score (no adjustment) is 1.37, the 
MSEs using u* are equal to 1.32 (u* = 0.30), and the MSEs using i* are equal 
to 1.22. The adjustment scheme tAdj is better than uAdj for three of the 
essays, although the reduction in MSE is greater than 0.1 only for essay A. 
For essay A, the adjustment AAdj yields further reduction of the MSE by 0.07. 
Score adjustment for essays B and D is useful but neither restrictive adjustment 
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scheme (uAdj or tAdj) approaches the efficiency of the genereil scheme (AAdj). 
For essay C, gains by adjustment are margineil. 

With incomplete (or no) information about the variances, the choice of the 
shrinkage factor is fraught with danger, as is the more discrete choice of whether 
to adjust the scores at all. For illustration of this problem Figures 2 and 3 
contain plots of the MSEs for the adjusted scores using (8) and (9), as functions 
of the respective coefficients u,- and U. For essays A and D these functions are 
within a narrow band for all responses^ and therefore only the function for an 
arbitrarily chosen response is plotted. For essays B and C there were responses 
rated by readers with small workloads, and so their MSEs as functions of the 
shrinkage coefficient differ from the rest of the responses. The MSEs for such 
responses, number 7 for essay B and number 102 for essay C, are plotted together 
with arbitrarily selected responses from the rest. 

For essay A the choice of the shrinkage coefficient is not crucial; not much 
reduction of the MSE can be achieved, although full adjustment, u= 1, is clearly 
worse than no adjustment, u = 0. For a response to essay B rated by a reader 
with a small workload, full adjustment is extremely risky because the reader's 
mean zj is based only on one response. There is an interesting contradiction; the 
minimum MSE for this response is smaller than the MSE for responses rated by 
readers with the usual workload. Better adjustment should be achieved when 
more information is available about the readers. Here this is not the case; this 
is another sign that the adjustment scheme is not fully efficient. The same 
phenomenon can be observed for essay C for which one reader with a small 
workload was engaged. 

Figure 3 contains the corresponding plots for the adjusted scores using (9) as 
functions of the coefficient t, . The contradiction observed for the scheme based 
on (8) arises here also. The choice of the shrinkage coefficient is somewhat more 
important; as in Figure 2, the largest meaningful adjustment, corresponding to 
ti = 0.5, is detrimental. 
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The example of these four essays suggests that indiscriminate use of (full) 
adjustment is very risky. Two factors contributing to this outcome are rela- 
tively small sample size (numbers of readers and their workloads) and small 
variation of reader severity. Thus, without considering shrinkage estimators, no 
adjustment is preferable to full adjustment. The simpler adjustment schemes 
based on (8) and (9) improve estimation of the true scores somewhat, but the 
scheme based on (10) is clearly superior. Note that for the simpler schemes the 
mean squared error of the adjusted score is smaller than the raw score only in a 
narrow range of the coefficients u ot t jiround the optimal coefficient u* (or T). 

Large differences among the MSEs for the essays suggest that combining the 
four (adjusted) scores into a single score should be done in a weighted fashion 
reflecting differential reliability of the scores. 

Studio Art Portfolio Assessment 

The Advanced Placement Studio Art test comprises a portfolio assessment in 
which artwork submitted by each examinee is rated on several criteria. Each 
of the six criteria, denoted A - F, is subjectively rated on the scale 0 - 4 by 
raters from the same pool. A score of zero is very rare; in most sessions it is 
received by less than one per cent of the 3756 examinees. The criterion A is 
rated by three different raters and the other criteria by two raters each. A rater 
may assess a portfolio on severed criteria, but no criterion is assessed twice by 
the same rater. The workloads of the raters within criteria vary considerably; 
apart from a few raters who rated fewer than four portfolios the workloads are 
in the range 120 - 600. 

Table 2 contains the variance estimates and a summary of the MSEs using 
the adjustment schemes uAdj, tAdj, and A Adj. Relative to variation of the 
true scores (o-^), the reader inconsistency (<7^) is very large. For the Advanced 
Placement Biology test the inconsistency variance is less than 30 per cent of 
the total variance o'a + o'j + for each essay; here the inconsistency variance is 
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around 40 per cent for all six criteria. Here, as in the Biology test, inconsistency 
variation dominates variation in severity (cr^). Nevertheless, a considerable re- 
duction of MSE is achieved by the adjustmeut schemes. There is little to choose 
between the schemes uAdj and tAdj, but AAdj yields a substantial additional 
improvement. 

Reader inconsistency is the principal cause of low score reliability. In prin- 
ciple, the variation could be reduced by further training and instruction of the 
readers. It is instructive to consider two components of inconsistency: disagree- 
ment in the merit of the rated material and variation in the grades given by 
the same reader in a hypothetical independent replication of the rating. These 
components may be reduced by treiining and instruction, by allowing the raters 
more time, and the like. Since rating of a criterion by the same reader cannot 
be replicated, one can only speculate about the relative contributions of these 
two causes to inconsistency variation. 

For illustration of model diagnostics the scores for criterion F are explored 
further. Of the 33 readers who took part, nine rated a total of 15 portfo- 
lios, and so no meaningful diagnostics for these readers can be generated. The 
remaining 24 readers had workloads of 143 - 533 responses. The standard- 
ized within-reader sample variances 5K,;]t/{n;ife(o"a + ^^)} are displayed in Ta- 
ble 3 (first two columns) together with their aggregates over the two sessions, 
12k SR,ik/{f^j{^l + ^e)} (third column). The expectations of these statistics 
are equal to unity, and their standard deviations to \/2/{njk{crl + ff?)} and 
V^2/{n,((T2 -f-ff?)}, respectively. For orientation, + ^j) = 2.2. Most of 

the statistics in Table 3 are within their theoretical standard deviations of unity. 
Also, the statistics for the two sessions are very close to one another for several 
readers. Two readers, 112 and 190, stand out; their statistics are large in the 
first session, but close to unity in the second. Rea'..c£ 190 had a small workload 
in the first session (18 responses), and so 5r,,i = 1.82 does not present strong 
evidence that the reader is unusual. However, 5h,;1 = 1-85 for reader 112 is a 
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strong indication of departure from the model because the standard deviation 
associated with Srji is equal to 2.2/\/T83 = 0.163. Reader 125 has high values 
of Snjk in both sessions. There is little evidence that any of the readers give 
the same score to almost everybody (small values of the statistics); reader 
117 has the smallest value of these statistics (0.53 in session 2). 

The variation in the score differences for responses rated by the same pair of 
readers can be treated similarly. The within-pair statistics can be accumulated 
over the readers, thus generating a x^-type statistic for each reader. If these 
statistics are aggregated within sessions, the components are independent. If 
they are aggregated across sessions, they are correlated because an elementary 
statistic is included for both readers. However, the correlations are small be- 
cause the severity variance is small. For the criterion F there are no outlying 
readers for either session, or for the aggregates across sessions; no reader can be 
identified who is in exceptional agreement/disagreement with the fellow readers. 
For brevity, details are omitted. 

CAPA tests 

The two examples analysed above suggest that inconsistency variation should 
be the principal concern with the readers. This is not a surprising conclusion; 
severity of the readers is 'anchored' by common expectations as well as by the 
scoring rubric. Being well-acquainted with the quality of the examinees, the 
readers make sure that they do not give extreme grades to too few or to too 
many examinees. 

The English Language and Literature (ELL) and Social Science (SSc) tests 
are two components of the Content Area Performance Assessments (CAPA). 
CAPA were developed jointly by Educational Testing Service and the Califor- 
nia Commission on Teacher Credentialing (CTC). CAPA in conjunction with 
a battery of multiple-choice NTE Specialty Area tests is used for teaching cer- 
tification in California. The data analyzed here are from the November 1992 

27 



31 



(operational) administration in which each test was taken by about 400 exami- 
nees. 

The ELL and SSc tests contain two essays each, denoted ELLl, ELL2, SScl, 
and SSc2, each rated by a pair of readers. The two tests use disjoint pools of 
readers, but the pools for the pair of essays within each test are identical. The 
scoring scale is 1 - 6 for each essay, and the 18 fully participating readers in 
each test rated between 36 and 82 responses. A small number of other readers 
rated not more than four responses each. 

Results of the essay scoring analysis are summarized in Table 4. The estim- 
ated severity variances (o-j) are about a sixth (ELLl) to a quarter (the other 
three essays) of the estimated inconsistency variances (Cj). The reduction in 
MSE due to score adjustment is modest, though, because the readers' workloads 
are small. 

The impact of the score adjustments can be assessed by summarizing the 
adjustments. Figure 4 contains the histograms of the adjustments d,-,u — yi,., 
o^i.t — Vi,., and d,- - y,-,. for the respective schemes uAdj, tAdj, and AAdj. The 
sample vMiances of these adjustments are 0.013, 0.016, and 0.027 (the sample 
means are within 0.01 of zero for each scheme); the scheme AAdj has the largest 
adjustments, uAdj the smallest. The adjustments for tAdj and AAdj are highly 
correlated {p = 0.77), while the adjustments for uAdj hav> lower correlations 
with both tAdj and AAdj {p = 0.50). 

The adjustments are in the range -0.40 - 0.50, with 92.5 per cent of the 
adjustments in the range -0.25 - 0.25; the coarseness of the rating scale remains 
transparent even after either of the three kinds of adjustments. For instance, 
most adjusted scores rounded to the nearest half-integer aie equal to the raw 
scores. The coarseness of the adjusted scores is also readily observed in the plots 
of the adjusted scores against the subscores for the multiple-choice pait of the 
test, drawn in Figure 5. The correlations of the scores from essay ELLl with the 
multiple-choice score are in the range 0.460 - 0.468, lowest for the unadjusted 



28 



32 



scores and highest for uAdj and AAdj . Such a small change in the correlation 
could not be interpreted as an improvement in v:.adity even if the essay and the 
multiple-choice part of the test were known to have a common unidimensional 
underlying trait. 

Standard errors 

The variance components have an important role in estimation of the true scores 
a,-. In all three adjustment schemes the es<jma<erf variance components are used 
instead of the unknown parameter values. Therefore^ it is important to establish 
the sampling variation of the variance components, and the dependence of the 
adjustments on the variance components. The latter is relatively streiightforward 
for the two restrictive schemes (uAdj and tAdj), but not for AAdj. This reduces 
somewhat the efficiency of the t.:heme AAdj for smeill administrations in which 
the variances are estimated subject to a lot of uncertainty. 

Although feasible, derivation of the sampling variance matrix of the statistics 
Se, Sr, and St is extremely tedious, unless <t? = 0. In any case, the variance 
matrix, and therefore the distribution of the estimators of the variances <t^, (Tj, 
and (Tg depends on unknown kurtoses of the random terms a<, and Cy. 

Since estimation of the variances is relatively simple the standard errors for 
the variances can be estimated by simulation. For a given assignment design, 
say, that for essay B in the Advanced Placement Biology test, the observed 
scores are replaced by those generated by the fitted model (with randomly drawn 
'examinees' and 'readers'). All the random variables are drawn from normal 
distributions with variances set to their estimated values from the real dataset 
{al = 3.74, (tI = 0.45, and <t^ = 1.40). The effect of rounding and truncation 
(to scores 0, 1, . . ., 10) can be explored in the same study by reestimating the 
variances with the generated (normal) scores rounded and truncated. 

The results are summarized in Table 5. For each variance its true value and 
mean and the standard deviation of the 200 simulated estimates are given. The 
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estimates are evaluated for the normally generated data (left-hand side of the 
table) and for their rounded and truncated versions (right-hand side). 

The results for the normal scores indicate that the estimators of the vari- 
ances are unbiased. For the rounded scores the distribution of the estimators 
is somewhat different, although the moments of &j are only slightly affected by 
rounding. On average, about 13 scores (out of 297) were truncated in a session. 
These scores are likely to contribute to reduction of the means of the variance 
estimates. The results in Table 5 can in no way be genereilized. For instance, 
when the scoring rubric is coarser, as in the CAPA tests, the variance ffg is 
affected. Results of the same simulation study for essay ELLl in CAPA are 
presented in Table 6, in the same format as Table 5. Now the rounding causes 
a slight inflation of the variance estimator a^. Thus, to get a rough idea of the 
sampling variation of the variance estimators in simulations rounding can be 
ignored. Figure 6 contains the histograms of the six sets of estimators; in the 
top row the histograms for the normal scores, and in the bottom row those for 
the rounded scores are drawn. The shapes of the sampling distributions of the 
variance estimators are only moderately skewed. 

The simulated data contain abundant information about the examinee vari- 
ance (Tg and the inconsistency variance (Tj. However, for small administrations 
estimation of the reader severity variance is clearly the Achilles heel of this 
approach. In both simulations the standard deviation of ffj is equal to about 
heilf its mean. This is likely to erode the advantage of the adjustment schemes 
over no adjustment, but probably not to the extent that no adjustment would 
be preferable. 

The scheme tAdj has a distinct advantage over uAdj in that it does not 
depend on (an estimate of) (Tj. Analytical discussion of the scheme AAdj is not 
feasible, but on a few examples the adjustments for tAdj and AAdj are highly 
correlated, especieilly when the estimated severity variance is smeill. 

Considerable improvement in estimation of (Tj may be achieved by pooling 
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information across multiple administrations of the same test or across similar 
forms of the same test when similar selection procedures, instruction, and train- 
ing of the readers are conducted. Estimation of cr^ could also be improved by 
embedding additional ratings in the assignment design, as proposed by Braun 
(1988), although this is of lesser importance since estimates of readers' severity 
coefficients are not required for true score estimation. 

The simulations can also throw light on the efficiency of the sample corre- 
lation r as an estimator of the between-reader correlation ri = {I + ti, + Te)~^ . 
For the simulations based on the allocation design for essay B of the Advanced 
Placement Biology test the simulated value of ri is equal to l/(H-1.85/3.74) 
= 0.669. The mean of the simulated estimates {1 + n + fj)"^ for the normal 
scores is 0.665 and the mean of the sample correlations is 0.664. The sampling 
standard deviations of these estimators are 0.044 and 0.045. For the rounded 
scores the correlations are only marginally reduced (to around 0.65), and the 
two estimators are almost equally efficient; in fact they are nearly identical. The 
simulated estimates of the pairs of correlations are plotted in Figure 7 for both 
normal and rounded scores. The sample correlation of the pairs of simulated 
estimates of the correlations is 0.998 for both sets of scores. Thus, the sample 
correlation is a good estimator of ri . 

The results of the simulations for essay ELLl in the CAPA test lead to 
similar conclusions. Details are omitted to conserve space. 

Summary 

A method for decomposition of the variance of essay ratings was presented. 
It identified sources of variation due to examinees and readers (severity and 
inconsistency). Extensions of the proposed model take account of changes in 
severity and inconsistency of the readers across the sessions or due to extraneous 
factors, enable relating severity and inconsistency across items (multivariate 
models), and allow for unequal numbers of ratings of the essays. The gains in 
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efficiency of the adjusted scores over the unadjusted scores are modest, though 
perceptible, especially for administrations in which readers have large workloads. 

Information about the variance components, and severity variation in pai- 
ticular, are important for adjustment of scores. Explicit equations were given 
which enable detailed discussion and near-optimal choice of adjustment of the 
scores. Standard errors for the estimated variances as well as for the correlations 
(reliabilities) can be obtained by simulations. Simulations can also be instru- 
mental in deciding on the assignment design. In particular, use of readers with 
small workloads should be avoided. In 'he studied exeimples reader inconsis- 
tency dominates variation in reader severity, and therefore the MSB's of scores 
caji be reduced by adjustment only moderately. 

The computational procedures were implemented on a Sun/Unix workstation 
using the SplusS.O software and the program codes developed can be obteiined 
from the author upon request. In practice, most of the computation can be 
carried out interactively, with exception of the simulations. For instance, the 
simulations for Advanced Placement Biology and the CAPA test took about 
five and ten minutes of elapsed tim^, respectively. 
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Table 1: Estimates of the variance components and estimated mean squared 
errors for Advanced Placement Biology test. 

The acronyms 'NAdj', 'uAdj'. 'tAdj', 'AAdj' stand for the mean squared errors 
for unadjusted scores, and scores adjusted using (8), (9), and (10), respectively. 



Advanced Placement Biology 



Variances Mean squared errors 



Item 








NAdj 


uAdj 


tAdj 


AAdj 


A 


8.20 


0.32 


2.42 


1.372 


1.321-1.324 


1.218-1.221 


1.145-1.147 


B 


3.74 


0.45 


1.40 


0.926 


0.768-0.865 


0.817-0.873 


0.718-0.723 


C 


7.07 


-0.01 


0.78 


0.392 


0.369-0.386 


0.367-0.377 


0.361-0.367 


D 


4.32 


0.72 


1.72 


1.222 


1.127-1.128 


1.081-1.081 


0.924-0.924 
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Table 2: Estimates of the variance components for Advanced Placement Studio 
Art Portfolio Assessment. 

The layout and notation are the same as in Table 1. 



Advanced Placement Studio Art 



Variances Mean squared errors 



Criteriou 




H 


al 


NAdj 


uAdj 


tAdj 


AAdj 


A 


0.310 


0.016 


0.199 


0.072 


0.066-0.067 


0.060-0.064 


0.055-0.058 


B 


0.448 


0.033 


0.220 


0.127 


0.108-0.111 


0.105-0.108 


0.089-0.097 


C 


0.384 


0.054 


0.222 


0.138 


0.112-O.116 


0.113-0.120 


0.087-0.100 


D 


0.320 


0.f47 


0.264 


0.155 


0.124-0.133 


0.117-0.137 


0.094-0.104 


E 


0.304 


0.J5.1 


0.276 


0.164 


0.128-0.139 


0.121-0.143 


0.095-0.106 


F 


0.359 


0.053 


0.279 


0.166 


0.133-0.141 


0.127-0.147 


0.101-0.113 
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Table 3: Within-reader diagnostics for Advanced Placement Studio Art Port- 
folio Assessment. 

For sessions 1 and 2 the standardized versions of the statistics Snjk are given. 
The column 'Both sessions' contains the standardized versions (theoretical 
expectation equal to unity) of these statistics pooled across the sessions. Each 
statistic is accompanied by the workload on which it is based. Statistics men- 
tioned in the text are printed in bold. 



Reader diagnostics — Studio Art Portfolio 





Session 1 


Session 2 


Both sessions 


K pn Hpr 


X /df 


load 




load 


X /dj 


load 


1 1 f\ 

110 


0.80 


152 


0.76 


174 


0.77 


326 


111 


0.65 


183 


0.77 


197 


0.71 


380 


112 


1.85 


151 


1.15 


141 


1.51 


292 


113 


1.05 


154 


0.85 


122 


0.96 


276 


114 


1.09 


164 


1.15 


153 


1.13 


317 


115 


0.92 


232 


0.81 


301 


0.86 


533 


116 


0.91 


162 


0.89 


134 


0.90 


296 


117 


0.77 


126 


0.53 


86 


0.67 


212 


118 


0.94 


170 


1.25 


153 


1.09 


323 


119 


0.86 


175 


1.03 


189 


0.97 


364 


120 


0.85 


214 


0.92 


166 


0.88 


380 


122 


1.2? 


97 


0.74 


46 


1.11 


143 


123 


1.42 


128 


1.11 


124 


1.28 


252 


124 


1.39 


131 


1.11 


197 


1.27 


328 


125 


1.86 


123 


1.43 


114 


1.68 


237 


126 


0.94 


207 


0.79 


224 


0.86 


431 


127 


0.84 


140 


1.72 


96 


1.30 


236 


128 


1.24 


170 


1.05 


183 . 


1.17 


353 


129 


0.80 


156 


0.88 


214 


0.85 


370 


180 


0.85 


195 


0.74 


177 


0.80 


372 


181 


1.03 


167 


0.83 


119 


0.94 


286 


182 


1.04 


232 


1.08 


205 


1.05 


437 


183 ■ 


1.03 


102 


1.07 


90 


1.06 


192 


190 


1.82 


18 


1.11 


143 


1.19 


161 
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Table 4: Estimates of the variance components for CAPA tests. 
The layout and notation are the same as in Table 1. 



Vaxiances Mean squared errors 



Item 








NAdj 


uAdj 


tAdj 


AAdj 


ELLl 
ELL2 


0.730 
0.762 


0.062 
0.056 


0.371 
0.240 


0.217 
0.148 


0.190-0.195 
0.124-0.127 


0.182-0.188 
0.132-0.133 


0.152-0.161 
0.108-0.112 


SScl 
SSc2 


0.707 
1.358 


0.077 
0.064 


0.321 
0.264 


0.199 
O.lb*! 


0.166-0,173 
0.142-0.144 


0.170-0.171 
0.152-0.153 


0.136-0.143 
0.130-0.132 
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Table 5: Summary of the simulations using the assignment design for essay B 
in Advanced Placen\ent Biology test. 

Two hundred simulations were generated. The scores were generated according 
to the model in (5), with variances equal to their estimates from the real dataset. 
The right-most column summarizes the numbers of scores that were smaller than 
zero or greater than 10 (out of a total of 2x297 = 594 scores). 



200 simulations, Essay B in AP Biology 




Normal scores 


Rounded scores 
















Trunc. 


True value 
Mean 
St. dev.-n 


3.740 
3.756 
0.421 


0.450 
0.455 
0.261 


1.400 
1.407 
0.194 


3.740 0.450 
3.440 0.418 
0.371 0.241 


1.400 
1.408 
0.188 


n/a 
26.370 
6.788 
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Table 6: Summary of the simulations using the assignment design for essay 
ELLl in the CAPA test. 

Two hundred simulations were generated. The scores were generated according 
to the model in (5), with variances equal to their estimates from the real dataset. 
The right-most column summeirizes the numbers of scores that were smaller than 
one or greater than 6 (out of a total of 2x419 = 838 scores). 







200 simulations, Essay ELLl 


in the CAPA test. 






Normal scores 


Rounded scores 














TVunc. 


True value 
Mean 
St. dev.-n 


0.730 
0.749 
0.067 


0.062 0.371 
0.077 0.350 
0.038 0.040 


0.730 
0.722 
0.065 


0.062 0.371 
0.074 0.421 
0.038 0.041 


n/a 
19.955 
5.075 
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Figures 

1. Relationship of the correlations of two scores given to the same essay by 
different readers 

2. The mean squared error as a function of the shrinkage coefficient Uj using 
the adjustment scheme based on (8). Advanced Placement Biology test. 

3. The mean squared error as a function of the shrinkage coefficient ti using 
the adjustment scheme based on (9). Advanced Placement Biology test. 

4. Histograms of the score adjustments for essay ELLl. The CAPA test. 

5. Plots of the adjusted scores for ELLl against the scores from the multiple- 
choice part of the CAPA test. 

6. Histograms of the simulated estimates of the variances using the assign- 
ment design from the ELLl essay in the CAPA test. 

7. Plots of the estimates of the correlation based ■ n the variance estimates, 
and the sample correlation f. 



41 



45 



ERIC 



00 

d 



c> 



c> 



CM 

d 



o 
d 




r_1 



Figure 1: Relationship of the correlations of two scores given to the same essay 
by different readers, r*i, on the horizontal axis, and the correlation of the mean 
score (K readers) with the true score, r*o, on the vertical axis. 
The number of scores contributing to the mean, K, is marked in the plot. 
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Figure 2: The mean squared error as a function of the shrinkage coefficient u,- 
using the adjustment scheme based on (8). Advanced Placement Biology test. 
Each essay is represented by a plot of the function for one or two responses. 
Response 7 for essay B and response 102 for essay C were rated by one reader 
each with small workload. The functions for responses with the usual workload 
almost coincide. 
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Shrinkage factor Shrinkage factor 

Essay C Essav D 

Figure 3: The mean squared error as a function of the shrinkage coefficient U 
using the adjustment scheme based on (9). Advanced Placement Biology test. 
The same layout as in Figure 2 is used, and the functions for the same responses 
are plotted. 
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Figure 4: Histograms of the score adjustments for essay ELLl. The CAPA test. 
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Figure 5: Plots of the unadjusted and adjusted scores for ELLl against the 
scores from the multiple-choice part of the CAPA test. 
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Figure 6: Histograms of the simulated estimates of the variances using the 
assignment design from the ELLl essay in the CAPA test. 
The first row refers to the normal scores, the second row to the rounded scores, 
the symbols (variances) 'A', 'B', and 'E' stand for <r^, <r|, and a^, respectively. 
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Figure 7: Plots of the estimates of the correlation t\ based on the variance 
estimates (horizonteil axis), and the sample correlation f (verticeil axis). 
Simulated data based on the allocation design for essay B of the Advanced 
Placement Biology test. 
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